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Abstract 

Understanding how crossover works is still one of the big challenges in evolutionary 
computation research, and making our understanding precise and proven by mathematical 
means might be an even bigger one. As one of few examples where crossover provably is 
useful, the (1 + (A, A)) Genetic Algorithm (GA) was proposed recently in [Doerr, Doerr, Ebel. 
Lessons From the Black-Box: Fast Crossover-Based Genetic Algorithms. TCS 2015]. Using 
the fitness level method, the expected optimization time on general OneMax functions was 
analyzed and a 0(max{nlog(?r)/A, An}) bound was proven for any offspring population size 
A e [1 ..n\. 

We improve this work in several ways, leading to sharper bounds and a better un¬ 
derstanding of how the use of crossover speeds up the runtime in this algorithm. We first 
improve the upper bound on the runtime to 0(max{nlog(n)/A, nAloglog(A)/log(A)}). This 
improvement is made possible from observing that in the parallel generation of A offspring 
via crossover (but not mutation), the best of these often is better than the expected value, 
and hence several fitness levels can be gained in one iteration. 

We then present the first lower bound for this problem. It matches our upper bound for all 
values of A. This allows to determine the asymptotically optimal value for the population 
size. It is A = ©(ydog (n) log log(n)/ log log log(n)), which gives an optimization time of 
0(n Y / log(n) log log log(n)/ log log(n)). Hence the improved runtime analysis both gives a 
runtime guarantee improved by a super-constant factor and yields a better actual runtime 
(faster by more than a constant factor) by suggesting a better value for the parameter A. 

We finally give a tail bound for the upper tail of the runtime distribution, which shows 
that the actual runtime exceeds our runtime guarantee by a factor of (l + <5) with probability 
0((n/\ 2 )~ 5 ) only. 

1 Introduction 

The role of crossover in evolutionary computation is still not very well understood. On the 
one hand, it is used intensively in practice, on the other hand, few rigorous theoretical or ex¬ 
perimental investigations support clearly the usefulness of crossover. The difficulties both in 
experimentally supporting early explanation models like the building block hypothesis [MHF93] 
or in constructing artificial example functions (e.g. [JW02]) such that simple evolutionary al¬ 
gorithms perform better with crossover than without (note that this would still only be a very 
weak support for the concept of crossover) rather suggest that crossover is not that easily em¬ 
ployed with success. In the meantime, a few less artificial examples were found where crossover 
provably leads to a runtime having a smaller asymptotic order of magnitude, namely a simplified 
variant of the Ising model on rings [FW04]) and on trees [Sud05] as well as a series of works on 
the all-pairs shortest path problem ([DHK12], [DT09], and [DJK + 13]). 
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In this work, we build on the latest algorithm where crossover was proven to be useful, the 
(1 + (A, A)) GA proposed in [DDE15]. Unlike most previous works, here crossover has a super¬ 
constant speed-up even for very simple functions like the OneMax test function. However, 
experiments show that the algorithm performs well also on linear functions, royal road functions, 
and maximum satisfiability instances [GP14] (in fact, the latter work shows that the (1 + 
(A, A)) GA outperforms hill climbers for several problems, for MaxSat it also outperforms the 
Linkage Tree Genetic Algorithm). The (1 + (A, A)) GA uses crossover in a different way than the 
previous works. Instead of trying to combine particularly fit solution parts, here a biased uniform 
crossover is used as a repair mechanism. Having such a crossover-based repair mechanism, we 
can use mutation with a higher rate, leading to a faster exploration of the search space. 

Given this different working-principle not seen before in discrete evolutionary optimization, 
there is a strong motivation to gain a deeper understanding of the (1+(A, A)) GA and its working 
principles. While in the first analysis of the (1 + (A, A)) GA on the OneMax test function 
Om : {0, l} n -A {0,1,... , n}; x i-A Ya =i Xi on ^ an u PP er bound of 0(max{n log(n)/A, An}) for 
the expected number of fitness evaluations was shown, we now determine the precise expected 
runtime to be 0(max(nlog(n)/A,nAloglog(A)/log(A)}) for all values of A E [l..n]. We thus 
both improve the upper bound and give the first lower bound, which in addition matches the 
upper bound. 

We further prove a strong concentration result for the runtime, showing that deviations 
above the expectation are unlikely: For all 5 > 0, the probability that the actual runtime 
exceeds a bound as above by a factor of (1 + 5) is at most 0{(n/\ 2 )~ s ). 

Our result on the expected runtime has two implications beyond showing that the (1 + 
(A, A)) GA is slightly (but still by a super-constant factor) faster than what the previous work 
guaranteed, (i) The previous upper bound gave the best runtime guarantee for A = ©(ydogn), 
namely O(n^/\ogn) with the old bound (and 0(n-\/logn) with our new bound). From our 
new and sharp runtime estimate, however, we derive a better value for A (together with the 
guarantee that it is asymptotically optimal), namely A = 0(-y/log(n) log log(n)/ log log log(n)). 
This gives an optimization time of 0(ny / log(?r) log log log(n)/ log log(n)). Hence we also improve 
the performance by determining a better value for the parameter A. (ii) Our analysis leading to 
these results also gives the desired additional insights in the working principles of this crossover 
operator and its interplay with mutation. The improved runtime guarantee is based on the 
observation that when generating A offspring in parallel, some have a fitness significantly better 
than expected. We exploit this to show that sufficiently often we gain sufficiently many fitness 
levels in one generation. Interestingly, the good runtimes shown for the (1 + (A, A)) GA only 
stem from better-than-expected individuals in the crossover phase, but not in the mutation 
phase. 

With the few runtime results on crossover-based EAs and, also, still the majority of the 
runtime results for EA in general being for evolutionary algorithms having trivial population 
sizes, we feel that our work also advances the state of the art in terms of analysis methods. Our 
argument that one out of A offspring can have a significantly better fitness than the expected 
fitness of one offspring resembles a similar one made by Jansen, De Jong, and Wegener [JDW05], 
who used multiple fitness level gains to prove that for the (1 + A) EA optimizing OneMax a 
linear speed-up (compared to A = 1) exists if and only if A = 0(log(?r) log log(ro)/ log log log(n)). 
Note that this result is different from ours in two respects, namely in that it does not show 
a positive influence of A on the optimization time (number of fitness evaluation), but only on 
the number of generations, and in that there the better-than-expected offspring is generated 
by mutation, whereas by crossover in our setting. A second difference is that the random 
experiment producing the new generation for the (1 + (A, A)) GA has stochastic dependencies 
that are not present in the (1 + A) EA, which require different arguments (e.g., a balls-into-bins 
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Algorithm 1: The (1 + (A, A)) GA from [DDE13] with offspring population size A, muta¬ 
tion probability p, and crossover bias c. The standard choices for the latter two parameters 
are p = X/n and c = 1/A. The mutation operator rriut^ generates an offspring from one 
parent by flipping exactly £ random bits (without replacement). The crossover operator 
cross c performs a biased uniform crossover, taking bits independently with probability c 
from the second argument and with probability 1 — c from the first parent. 
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Initialization: Choose x € {0, l} n uniformly at random and evaluate /(x); 
Optimization: for t = 1,2, 3,... do 

Mutation phase: Sample £ from B(n,p); 
for i = 1,..., A do 

a;W f- mut^(x) and evaluate /(xW); 

Choose x' € {x^,..., x^ A ^} with f(x') = max{/(x^^),..., /(x^ A ^)} u.a.r.; 
Crossover phase: for i = 1,..., A do 
yW cross c (x,x / ) and evaluate /(y^); 

Choose y € {yW, ■ ■ ■ ,y^} with f(y) = m ax{/(?/ (1) ), • • •,/(y {A) )} u.a.r.; 
Selection step: if /(y) > /(x) then x <— y; 


argument). 

2 The (1 + (A,A)) GA 

The (1 + (A, A)) GA is a simple evolutionary algorithm using crossover. Following [DDE13, 
DDE15] we present it here for the maximization of pseudo-Boolean functions / : {0,1} ?1 —>• M. 
Its pseudo-code is given in Algorithm 1. 

The (1 + (A, A)) GA is initialized with a solution candidate drawn uniformly at random 
from {0, l} n . It then proceeds in iterations (rounds) consisting of a mutation, a crossover, and 
a selection phase. In an important contrast to many other genetic algorithms, the mutation 
phase precedes the crossover phase. This allows to use crossover as a repair mechanism, as we 
shall discuss in more detail below. 

In the mutation phase of the (1 + (A, A)) GA, we create A offspring from the current-best 
solution x by applying to it the mutation operator rnut^(-), which samples l (different) positions 
uniformly at random and generates a new bit string from the input by flipping the bits in these l 
positions. That is, mut^(x) is a bit string in which for l random positions i the entry x* € {0, 1} is 
replaced by 1—x*. The step size £ is chosen randomly according to a binomial distribution B(n,p ) 
with n trials and success probability p: in our analyses we follow the suggestion in [DDE15] and 
use p = X/n. The expected distance of a random offspring to x is thus A. To ensure that all 
mutants have the same distance from the father x, the same £ is used for all A offspring. The 
fitness of the A offspring is computed and the best one of them, x ', is selected to take part in 
the crossover phase. If there are several offspring having maximal fitness, we pick one of them 
uniformly at random (u.a.r.). 

When x is already close to an optimal solution, the offspring created in the mutation phase 
are typically all of much worse fitness than x. Our hope is though that they have discovered 
some parts of the optimum solution that is not yet reflected in x. In order to preserve these 
parts while at the same time not destroying the good parts of x, the (1 + (A, A)) GA creates 
in the crossover phase X offspring from x and x'. Each one of these offspring is sampled 
from a uniform crossover with bias c to take an entry from x'. That is, each of the offspring 
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cross c (x, x') is created by taking independently for each position i the entry x\ with probability 
c and taking the entry from Xi otherwise. Again we follow the suggestion from [DDE15] and 
use c = 1/A. With this choice, we try to ensure that the new individual y is close to x even 
then the mutation rate is high. Note that cross c (x, muty(x)) is just the outcome of applying 
to x standard-bit mutation with mutation probability 1/n. Of course, due to the intermediate 
selection of x 1 , our new individual y follows a more complicated distribution (which is the reason 
for the performance of our algorithm). Again we evaluate the fitness of the A new offspring and 
select the best one of them, which we denote by y. If there are several offspring of maximal 
fitness, we simply take one of them uniformly at random. 1 

Finally, in the selection step the current-best solution x is replaced by its offspring y if and 
only if the fitness of y is at least as good as the one of x. 

As common in the runtime analysis community, we do not specify a termination criterion. 
The simple reason is that we study as a theoretical performance measure the expected number 
of function evaluations that the (1 + (A, A)) GA performs until it evaluates for the first time a 
search point of maximal fitness (the so-called optimization time). Of course, for an application 
to a real problem a termination criterion has to be specified. 

3 Runtime Analysis 

Runtime analysis is one of the most successful theoretical tools to understand the perfor¬ 
mance of evolutionary algorithms. The runtime or optimization time of an algorithm (e.g., 
our (1 + (A, A)) GA) on a problem instance (e.g., the OneMax function) is the number of 
fitness evaluations that are performed until for the first time an optimal solution is evaluated. 

If the algorithm is randomized (like our (1 + (A, A)) GA), this is a random variable T, and 
we usually make statements on the expected value E[T] or give bounds that hold with some 
high probability, e.g., 1 — 1/n. When regarding a problem with more than one instance (e.g., 
traveling salesman instance on n cities), we take a worst-case view. This is, we regard the 
maximum expected runtime over all instances, or we make statements like that the runtime 
satisfies a certain bound for all instances. 

In this work, the optimization problem we regard is the classic OneMax test problem con¬ 
sisting of the single instance Om : {0,l} n —>• {0,1,... ,n};x i-A Yli=l that 1S > maximizing 
the number of ones in a bit-string. Despite the simplicity of the OneMax problem, analyzing 
randomized search heuristics on this function has spurred much of the progress in the theory 
of evolutionary computation in the last 20 years, as is documented, e.g., in the recent text¬ 
book [Janl3]. 

Of course, when regarding the performance on a single test instance, then we should ensure 
that the algorithm does not exploit the fact that there is only one instance. A counter-example 
would be the algorithm that simply evaluates and outputs x* = (1,..., 1), giving a perfect 
runtime of 1. One way of ensuring this is that we restrict ourselves to unbiased algorithms 
(see [LW12]) which treat bit-positions and bit-values in a symmetric fashion. Consequently, 
an unbiased algorithm for the OneMax problem has the same performance on all problems 
with isomorphic fitness landscape, in particular, on all (generalized) OneMax functions Om z : 
{0, l} n -» {0,1,..., n}; x i-A eq(x, z) for z e {0, l} n , where eq(x, z) denotes the number of bit- 
positions in which x and z agree. It is easy to see that the (1 + (A, A)) GA is unbiased (for all 

1 In [DDE13, Section 4.4] and [DDE15] a different selection rule is suggested for the crossover phase, which 
is more suitable for functions with large plateaus of the same fitness value. Since we consider in this work only 
the OneMax function, for which both algorithms are identical by symmetry reasons, we refrain from stating in 
Algorithm 1 the slightly more complicated version proposed there, which selects the parent solution x only if 
there is no offspring ^ x of fitness value at least as good as the one of x. 
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parameter settings). We will henceforth not stress this fact anymore. Any other algorithms we 
will discuss are also all unbiased. 

The result we build on is this work is the runtime analysis of the (1 + (A, A)) GA for various 
parameter settings on the OneMax problem in [DDE15]. This analysis suggested, in particular, 
certain settings for the parameters to which we will restrict ourselves in the following. For these 
settings, the following upper bound for the expected runtime was proven (Theorem 4 in [DDE15] 
for A = k, note that the case A = 1 excluded there is trivial for A = k since in this case, the 
(1 + (A, A)) GA imitates the (1 + 1) EA). 


Theorem 1 ([DDE15]). Let A € [1. .n], possibly depending on n. The expected optimization 
time of the (1 + (A, A)) GA with mutation probability p = A/n and crossover bias c = 1/A on 
OneMax is 


O 


max 


n log n 
A 


, An 


In particular, for A = 0(Vlogn), the expected optimization time is of order at most n^/logn. 


The main result of this work is the following improvement and strengthening of the previous 
result. 


Theorem 2 (our main result). Let A € [l..n]. The expected optimization time of the (1 + 
(A, A)) GA with mutation probability p = A/n and crossover bias c = 1/A on the OneMax test 
function is 


Jnlog(n) nAloglog(A) 1 A 

' MA) j) ' 

For all 6 > 0, the probability that the actual runtime exceeds a bound of this magnitude by a 
factor of more than (1 + 5) is at most 0((n/A 2 )~' 5 ). 

The expected runtime is minimized by the parameter choice A = 

©(ydog^) log log(n)/ log log log(n)). This yields an expected optimization time of 

0(n-\/log(n) log log log(n) / log log(n)). 


0 


( 


max 


4 Notation and Technical Tools 

In this section, besides fixing some very elementary notation, we collect the main technical tools 
we shall use. Mostly, these are large deviations bounds of various types. For the convenience 
of the reader, we first state the known ones. We then prove a tail bound for sums of geometric 
random variables with expectations bounded from above by the reciprocals of the first positive 
integers. We finally state the well-known additive drift theorem. 


4.1 Notation 

As the reader has experienced already, we write [a..b\ to denote the set {z € Z | a < z < b} of 
integers between a and b. We also use the short-hand [a] := [l..a]. We write log(n) to denote 
the binary logarithm of n and ln(n) to denote the natural logarithm of n. However, to avoid 
unnecessary case distinctions, we define log(n) := 1 for all n < 2 and ln(n) := 1 for all n < e. 
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4.2 Known Chernoff Bounds 

The following large deviation bounds are well-known and can be found, e.g., in [Doell]. We 
call all these bounds Chernoff bounds (CB) despite the fact that it is now known that some 
have been found earlier by other researchers. 

Theorem 3 (classic Chernoff bounds). Let X±,..., X n be independent random variables taking 
values in [0,1]. Let X = Y17 =i -V. 

(a) . Let 5 > 0. Then 

Pr[X>(l +d)E[X]}<( w ^ TT ,) E m. 

(b) . Let 5 £ [ 0,1]. Then 

Pr[X > (1 + 5)E[X]] < exp(-<5 2 £[A]/3). 

(c) . Let 5 £ [ 0,1]. Then 

Pr[X < (1 - 5)E[X}} < exp{-S 2 E[X}/2). 

Binary random variables X\ ,..., X n are called negatively correlated , if for all I C [l..n] we 
have Pr[Vi e I : X, = 0] < Pl '[ X * = °] and Pr I V * € / : W = 1] < \\, a PrpQ = 1]- 

Theorem 4 (CB, negative correlation). Let X±,... ,X n be negatively correlated binary random 
variables. Let a± ,...,a n € [0,1] and X = Ya=i a *Aj. Then X satisfies the Chernoff bounds 
given in Theorem 3 (b) and (c). 

Chernoff bounds also hold for hypergeometric distributions. Let A be any set of n elements. 
Let B be a subset of A having m elements. If Y is a random subset of A of IV elements (chosen 
uniformly at random from all IV-element subsets of A. then X := \Y (1 B\ has a hypergeometric 
distribution with parameters ( n,N,m ). 

Theorem 5 (CB, hyper geometric distributions). If X has a hypergeometric distribution with 
parameters ( n,N,m ), then E[X\ = Nm/n and X satisfies the Chernoff bounds given in Theo¬ 
rem 3. 

4.3 A Chernoff Bound for Geometric Random Variables 

To prove the concentration statement in Theorem 2, we need a tail bound for the upper tail of 
a sum of a sequence of independent geometric random variables having expectations that are 
upper-bounded by a multiple of the harmonic series. While generally Chernoff bounds for geo¬ 
metric random variables are less understood than for bounded random variables, Witt [Wit 14] 
proves such a bound. Witt’s bound is sufficient for our purposes. For two reasons, we prove the 
following alternative result below, (i) Our proof is a simple reduction to the well-understood 
coupon collector process, and thus much simpler than Witt’s, (ii) At the same time, our proof 
gives a stronger bound (for our setting, Witt’s bound on the failure probability is roughly the 
fourth root of ours). Since scenarios as treated here are quite common in runtime analysis (for 
example, they appear whenever the fitness level method is employed in a situation where the 
probability of a progress is inversely proportional to the fitness distance from the optimum), we 
feel that presenting our result is justified here. 

We say that X has a geometric distribution with success probability p if for each positive 
integer k we have Pr[X = k] = (1 — p^^p. For all n € N, let H n := Y7a= i( 1/*) denote the nth 
Harmonic number. 
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Lemma 6. Let X ±,..., X n be independent geometric random variables with success probabilities 
Pi. Assume that there is a number C < 1 such that pi > Ci/n for all i € [n]. Let X = U”=i A*. 
Then E[X] < (1 /C)nH n < (1/C7)n(ln(n) + 1) and Pr[A' > (1 + 5)(1/C)n ln(n)] < n~ 5 for all 
5 > 0 . 

Proof. For i € [n], let X[ be a geometric random variable with success probability exactly 
Ci/n =: p\. Let the X- be independent. Then X[ dominates Xi for all is [n], and consequently, 
X' := X^r=i dominates X. Recall that a random variable Y' dominates a random variable Y 
if for all r € M, Pr[Y > r] < Pr[Y' > r\. Note that this implies that E[Y] < E[Y r ] and that all 
upper tail bounds for Y' immediately take over to Y. Consequently, we can conveniently argue 
for X' instead of X. 

For the statement on the expectation, we recall that the expectation of a geometric random 
variable with success probability p is 1/p. Consequently, by linearity of expectation, we have 

E[X'\ = E?=i E[X'} = E?=i(l/p') = (n/C) EEi(IA) = (n/C)H n . 

For the tail bound, consider the following coupon collector process. There are n different 
types of coupons. In each round (independently), with probability C we obtain a coupon 
having a type chosen uniformly at random and with probability 1 — C we obtain nothing. We 
are interested in the number T of rounds until we have each type of coupon at least once. For 
i € [n], let T) denote the number of rounds needed to get a coupon of a type not yet in our 
possession given that we have already n — i different types. In other words, T) is the time we 
need to go from u i types missing” down to “i — 1 types missing”. We observe that T) has the 
same distribution as X t and that T = Ya=i U Consequently, T and X are equally distributed. 

The advantage of this reformulation is that it allows us a different view on X' ~ T: The 
probability that after t rounds of the coupon collector process we do not have a fixed type 
is exactly (1 — C/n) t . Using a union bound, we see that the probability that after t rounds 
some coupon is missing, is at most n(l — C/nfi. For t = (1 + 6)(1/C)n ln(n), this is at most 
n(l — C/nfi < nexp(— Ct/n) = nexp(—(1 + 5) ln(n)) = n~ s . □ 

Note that in the proof of Lemma 6, once we have defined the coupon collector process (but 
not before), we could have also used multiplicative drift. This would, however, not have given 
better bound, nor a shorter proof. 

4.4 Additive Drift 

Drift analysis comprises a couple of methods to derive from information about the expected 
progress (e.g., in terms of the fitness distance) a result about the time needed to achieve a 
goal (e.g., finding an optimal solution). We shall several times use the following additive drift 
theorem from [HY01] (see also Theorem 2.7 in [OY11]). 

Theorem 7 (additive drift theorem). Let Xq,Xi,... be a Markov process over a finite state 
space S. Let g : S —>• M>o- Let T := min{f > 0 | X t = 0}. Let 6 > 0. 

(i) If for all t, we have E[X t — X t+ i\X t > 0] > 5, then E[T\Xq] < g{Xfi)/5. 

(ii) If for all t, we have E[X t — X t +i\X t > 0] < 5, then i?[T|Ao] > g(Xo)/5. 

5 Proof of the Upper Bound 

In this section, we prove the upper bound statement of Theorem 2, that is, that the (1 + 
(A, A)) GA with standard parameter settings optimizes every OneMax function using a number 

of O |max | n lo |( n ) , nA i°g(A) }) fit ness evaluations both in expectation and with probability 
1 — n~ c , where c is an arbitrary positive constant. 
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The proof of the previous upper bound (Theorem 1) was based on the fitness level method 
(first used in [WegOl] in the proof of Theorem 1, more explicit in [Weg02], see also [OY11]). In 
its classic version, this method pessimistically estimates the runtime via the sum of the times 
needed to leave each fitness level. It thus does not profit from the fact that a typical run of the 
algorithm might not visit every fitness level. By a more careful analysis of the mutation phase 
(Lemma 8) and the crossover phase (Lemma 9), we shall show that this indeed happens. For 
all values of A, we obtain that when starting an iteration with a search point x having fitness 
distance d{x) := n — Om(x) at least nloglog(A)/log(A), then the average fitness improvement 
is I7(log(A)/loglog(A)). Consequently, additive drift analysis (Theorem 7) tells us that only 
0(nloglog(A)/log(A)) iterations are needed to find a search point with fitness distance at most 
nloglog(A)/log(A). Note that the fitness range from the typical initial fitness distance of n/2 to 
a fitness distance of nloglog(A)/log(A) contains f fin) fitness levels. Hence the previous analysis 
would have given only a bound of 0(n ) rounds. 

There is an intuitive explanation for these numbers based on the balls-into-bins paradigm. 
When our current search point x is in distance n/D from the optimum, then already x^ has 
an expected number of at least X/D “good bits”, i.e., bit positions that are zero in x and one in 
x ^. The same is true for x'. Each of these good bits is copied in each of the generated in the 
crossover phase with probability 1/A. The total number of copies of good bits in y^\ ... ,yW 
thus is around X/D again. Since they are uniformly spread over the y^\ we are in a situation 
closely resembling the balls-into-bins scenario, in which X/D balls are uniformly thrown into A 
bins. By a result of Raab and Steger [RS98], we know that when D is at most polylogarithmic 
in A, then the most-loaded bin will contain 0(log(A)/loglog(A)) balls. For our setting, this 
means that we expect one of the y ^ to inherit f2(log(A)/loglog(A)) good bits. Unfortunately, 
since we do not distribute the good bits completely independently, we cannot transform this 
intuitive argument into a rigorous proof, but need to argue differently. 

We start by analyzing the mutation phase. Since we aim at understanding those iterations 
where we gain more than a constant number of fitness levels, we restrict ourselves to the case 
that A = w(l), which eases the calculations. 

Lemma 8. Let e > 0. Assume that X = w(l). Let x € {0, l} n . Let D be such that d := d{x) = 
n/D. Assume that D = o(A). Consider one run of the mutation phase of Algorithm 1. As in 
the description of the algorithm, denote by £ the actual mutation strength and by x' the winner 
individual. Let B' := {i € [n] | Xj = 0 A x\ = 1} the set of 1-bits that x' has gained over x. 

Then with probability 1 — o(l), we have both \£ — A| < eA and \B'\ > (1 — e)X/D. 

The statement follows easily from Theorem 3 and 5. 

Proof. Since A = w(l) and £ follows a binomial distribution with parameters n and A/n, a 
simple application of the Chernoff bound (Theorem 3 (b) and (c)) implies that with probability 
1 — o(l) we have \£ — A| < (e/2)A. Conditional on that, we analyze how the first offspring 
a/ 1 ) is generated. Let B\ be the set of bit positions that are zero in x and one in a/ 1 ), that 
is, Bi := {i € [n] | Xi = 0 A x ^ = 1} (“good bits”). Then E[|i?i|] = d(£/n) = l/D. Since 
D = o(A) and i = 0(A), this expectation is w(l) and a Chernoff bound for the hypergeometric 
distribution (Theorems 5 and 3 (c)) shows that we have Pr[|Ri| > (1 — (e/2 ))£/D\ = 1 — o(l). 
Since all x^\ j € [A], have the same Hamming distance from x, the fittest individual x' is also 
the one with the largest number of good bits. Hence \B'\ > |L>i| > (1 — (e/2 ))£/D > (1 — e)A /D 
with probability 1 — o(l). □ 

We next analyze a run of the crossover phase. While in the previous lemma we only exploited 
that an individual generated in the mutation phase has roughly as many good bits as expected, 


we shall now exploit that the best of the A individuals generated in the crossover phase is much 
better than the average one. 


Lemma 9. Let x,x' € {0, l} n such that their Hamming distance l := H(x,x') satisfies £ < 
2A — 2. Let D' be such that B' := {i e [n] \ Xi = 0 A x\ = 1} satisfies \B'\ > \/D'. Consider 
a run of the crossover phase starting with these variable values and computing an offspring 

2/e{ o,i} n . 

Then with probability at least 1 — 1/e, we have Om (y) — Om(x) > L m i n {(i ln(A) — 
l)/(lnln(A) + ln^')). 

Proof. Let 7 < X/D' be a positive integer. Consider the outcome of a single crossover 
operation for some j € [A] in Algorithm 1. Let Aj be the event that Om(?/(^) > Om(x) + 7 . 
This event in particular occurs when the crossover operation selects 7 “good bits” (those with 
index in B') from x' and none of the “bad bits” (those, in which x and x' differ, but that are 
not in B'). Consequently, 

Pr[A,]> (^(l/AHl-l/A)^ (1) 

>(|S'|/7) 7 (1/A) 7 (1-1/A) 2 ( a - 1 ) 

> (A/(H' 7 )) 7 (l/Ar(l/e 2 ) 

= exp (—2 — 7 In 7 — 7 lniy). 


For 7 = |_niin{(^ ln(A) — l)/(lnln(A) + ln(D')), X/D'}\ we have Pr[A/ > 1/A. Consequently, 
the probability that at least one of the Aj holds, is at least 1 — (1 — 1 /A) A > 1 — 1/e. □ 

We note that the argument up to (1) is very similar to the reasoning in the proof of Theorem 5 
in [JDW05], where it is shown that the (1 + A) EA optimizing OneMax performs super-constant 
improvements in the early part of the optimization process. The choice of our 7 , however, is 
different due to the different relation of A and n. Interestingly, in the analysis of the mutation 
phase, such arguments do not seem to give significant additional improvement (recall that there 
we only used the expected gain from a fixed single offspring). 

Above, we showed that in the early part of the optimization process, we regularly gain 
more than one fitness level in one iteration. For the remainder, we re-use the fitness level type 
argument of [DDE15], which is summarized in the following lemma (Lemma 7 of [DDE13] in 
the special case that k = A). 


Lemma 10. Assume A > 2 . In the notation of the (1 + (A, A)) GA, the probability that one 
iteration produces a search point y that is strictly better than the parent x, is at least 


Pd(x) 


c 1 



where C is an absolute constant. 


We are now ready to prove the main result of this section. 

Proof of the upper bound in Theorem 2. We regard the different regimes of the optimization 
process separately, since they need very different arguments. If A = u;(l), then let do : = 
nlnln(A)/ln(A), else let do = n (and there is no first phase). 

First phase: From the random starting point to a solution x with d(x) < do ■ = 
nlnln(A)/ln(A) in 0(n loglog(A)/log(A)) iterations. Let D = ln(A)/lnln(A). Let x € {0, l} n 
be any search point with d{x) > do = n/D. By Lemma 8 and 9, we see that with probability 
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p = 1 — (1/e) — o(l), one iteration of the main loop of Algorithm 1 produces a solution y with 
Om( y) > Om(x) + A, where A is some number satisfying A = fl(log(A)/loglog(A)). This seems 
to call for an application of additive drift (Theorem 7), but in particular for the derivation of 
the large deviation claim, the following hand-made solution seems to be easier (despite several 
tail bounds for additive drift existing, see, e.g., [Kotl4] and the references therein). 

For t = 1,2,.. . lets us define the following binary random variable Xt- If at the start of 
iteration t we have d(x) > do, then X t = 1 if and only if Om( y) > Om(x) + A. If d(x) < do, 
let Xt = 1 with probability p independent from all other random decisions. For all T > 0, we 
observe that Xt '■= Ya =i W > n/A implies that To < T, that is, our (1 + (A, A)) GA needed at 
most T iterations to find a search point x with d(x) < do- We have E[Xt\ = Tp and Pr [Xt < 
(1/2 )E[Xt\\ < exp(— E[Xt\/8). In particular, for T = 2n/(Ap), we have E[Xt\ = 2n/A and 
Pr[Xr < n/A] < exp(— E[Xt\/8) = exp(—n/(4A)) = exp(—n 1_0 (P). 

Second phase: From a solution with d-value at most do 1° one with d-value at 
most d\ := 1 _ 77 ./(2A 2 )J in 0(n log log(A)/log(A)) iterations. Once we have a solution of fitness 
distance at most do, we use the fitness level argument analogous to the proof of Theorem 1. 
We reformulate the proof slightly to allow proving a large deviation bound for the optimization 
time. By Lemma 10, the remaining number of iterations is dominated by a sum of geometric 
random variables X ( i 0 ,... ,X] where Pr[AQ = m] = (1 — Pd) m ~ l Pd for all m = 1,2,... and 
Pd = C (1 — ) a2 / 2 ) is as in Lemma 10. 

Note that for d > d\, Pd = C' for some absolute constant C' . Hence the expected number 
T\ of iterations to reduce the fitness distance to d\ is at most E[T\] = E[X r i 0 + • • • + X^-i] < 
(l/C')(do — d\) < (\/C')do = 0{n loglog(A)/log(A)) by linearity of expectation. Since each 
iteration with d(x) > d\, independent of what happened in the previous iterations, has a success 
chance of at least C' , we observe that the probability to have fewer than (do ~ di) successes in 
2(l/C')(do — di) iterations is at most exp(—(do — di)/(4C")) = exp(—0(do)) = exp(—jr 1-0 ^ 1 )). 
Note that to apply the (multiplicative) Chernoff bound, here we used the “moderate indepen¬ 
dence” argument of Lemma 1.18 of [Doell]. 

Third phase: From a solution with fitness distance at most d\ to an optimal 
solution in 0(nlog(n)/A 2 ) iterations. We continue to use the fitness level method as in the 
previous section of the proof, but note that for d < d\, we have pd = C( 1 — (1 — d/n) A / 2 ) > 
C(1 — exp(dA 2 /(2n))) > (7dA 2 /(4n), where we used the estimate e~ x < 1 — x/2 valid for all 
x S [0,1]. We thus see that the remaining time T 2 to get to the optimal solution is dominated by 
X = X ( i, + • • • + X\, which is a sum of independent geometric random variables with harmonic 
expectations. Hence by Lemma 6 , we have T[T 2 ] < E[X] < = 0(nlog(n)/A 2 ) and 

Pr[T 2 > (1 + S) 4 w ^:f 1 - ] < d /* 5 for any 5 > 0 . 

In total, we see that the number T = To + T\ + T 2 of iterations until the op¬ 
timum is found has an expectation of at most E[T] = T[Tq] + E[T\] + T[T 2 ] = 
0(max{?r log(n)/A 2 , nlog log(A)/ log(A)}) and the probability that this upper bound is exceeded 
by a constant factor of (1 + 6) is only 0((n/\ 2 )~ s ). 

Since in each iteration the fitness of 0(A) search points is computed, we proved the claimed 
upper bound of 0(max(nlog(n)/A,nAloglog(A)/log(A)}) for the expected optimization time, 
and again, exceeding this expectation by a factor of l + <5 has a probability of only 0((n/A 2 )~ 5 ). 

□ 


6 A Matching Lower Bound 

In this section, we prove the first lower bound for the runtime of the (1 + (A, A)) GA. It matches 
the new upper bound proven in the previous section, so the two bound determine the asymptotic 
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runtime of the (1 + (A, A)) GA for all A E [l..n]. This sharp runtime result immediately gives 
the optimal value for the population size A. 

Theorem 11. For all A < n the (1 + (A, A)) GA with the standard parameter setting p = A/n 
and c = 1/A needs an expected number of 

nlog(n) nAloglog(A) 1 A 
A ’ log(A) J ) 

fitness evaluations to find the optimum of any OneMax function. 

To prove the theorem, we show that the expected optimization time is both ^( nl °g( n ) ) anc j 
^( nA i°g(A)^ ) ( a ^ l eas t f° r sufficient ranges of A). We do this separately in the following two 
subsections. The proof of Theorem 11 then is an immediate consequence of Lemmas 12 and 13. 

We remark without proof that also the lower tail of the runtime distribution admits tail 
bounds. Since such bounds are less relevant for the use of algorithms, we do not give further 
details. 


n ( 


max 


6.1 First Lower Bound 


To prove that D(nlog(n)/A) is a lower bound, we use the standard argument that in order to 
find the optimum at least each bit that was not initially set to one has to be touched at least 
once by a mutation operator. This argument has been used, e.g., in the classic proof for the 
lower bound of the runtime of the (1 + 1) EA in [DJW02]. We have to be slightly more careful 
though, because our random experiment has two types of dependencies: (i) Each individual in 
the mutation phase is not generated by standard bit mutation (flipping each bit independently), 
but by flipping a fixed number £ of bits, (ii) This £ is chosen randomly, but all A individuals 
generated in one mutation phase are generated using the same value of l. 

Lemma 12. Let A be an integral function of n with 1 < A < n/4. The probability that the (1 + 
(A, A)) GA with standard parameters p = A/n and c = 1/A has found the optimum within T = 
|_nlog(n)/( 8 A 2 )J iterations (equivalent to 2A T fitness evaluations) is exp(—fl(min{AT, y/n})). 
In particular, the expected optimization time is D(nlog(n)/A). 

Proof. Using the Chernoff bound of Theorem 3 (c), we see that with probability 1 — exp(—D(n)), 
the initial search point has at least n/3 bits valued zero (“missing bits”). Let us consider what 
happens in the first T = [n log(n)/( 8 A 2 )J iterations. Denote by l \,..., It the values of £ chosen 
by the algorithm in these iterations. Again, with probability 1 — exp(—D(n)), all t % are at most 
n/2 (Chernoff bound of Theorem 3 (a) and union bound, note that a binomially distributed 
random variable is a sum of independent 0,1 random variables). Using these arguments a 
second time as well as the fact that the are independent, we obtain that with probability 
1 — exp(—D(AT)), we have — 2AT. Conditional on none of these exceptional event 

occurring, the probability that a particular one of the missing bits is never flipped in the 
mutation phases of the first T iterations is 


JJ(1 - £i/n) x > JJexp(- 2 £j/n) A = exp 
1=1 1=1 

> exp(—4A 2 T/n) > n“ 1/2 , 



where we have used in the first step that 1 — c > e~ 2c for 0 < c < 1 / 2 . 

The events that a bit was never flipped in a certain time interval are not independent, since 
also in a single application of the mutation operator the bits are not treated independently. 
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However, since we always flip a fixed number of bits (not necessarily the same in each iteration), 
these events are negatively correlated. We omit this proof, because it is technical and lengthy, 
but neither difficult and nor insightful. From this, we conclude that the probability that there 
is a missing bit that was never flipped up to iteration T is at least 1 — (1 — n -i/2^n/3 > 
1 — exp(— to -1 / 2 )™/ 3 = 1 — exp(—n 1 / 2 /3), this time using the estimate 1 + x < e x valid for all 
x <E M. 

Consequently, with probability at least 1 — exp(—fl(re)) — exp(— Q(\T)) — exp(—n 1/,2 /3), the 
(1 + (A, A)) GA needs more than T iterations to find the optimum. This immediately implies 
the claimed bound on the expected optimization time. □ 

6.2 Second Lower Bound 

We prove the second lower bound via drift analysis. We show that the expected fitness increase 
in each round is 0(log(A)/loglog(A)). Then the additive drift theorem yields the lower bound 
on the expected optimization time. While our arguments are valid also for constant A, in 
particular the asymptotic notation becomes easier when assuming A = w(l). We can make this 
assumption freely, since for constant A the first lower bound is stronger anyway. 

Lemma 13. Let A = uj{ 1). Then the expected optimization time of the (1 + (A, A)) GA with 
parameters p = A jn and c = 1/A is at least H(nAloglog(A)/log(A)). 

Proof. Let x be any search point different from the optimum. We first analyze what happens 
in a typical iteration of the main loop of the (1 + (A, A)) GA and later treat the exceptional 
cases. Let £, x', and y ^ l \... ,yW be as in the pseudo-code of Algorithm 1. 

With probability 1—exp(—0(A)), we have £ < 2A (Chernoff bound of Theorem 3 (b)). Hence, 
trivially, B := {i £ [n] | Xi = OAx' = 1} has cardinality at most 2A. Let Y) := {i E B \ yf^ = 1} 
the set of new 1-bits that made it into y^\ We aim at showing that \Yj\ is not very large. We 
have FI[|Y)|] = |-B|/A < 2. There is nothing to show if .E[|Y)|] = 0, hence let us assume that 
i£[|Y)|] > 0. Let r := (2eln(A)/lnln(A))/E , [|Y)|] and note that r > eln(A)/lnln(A) =: r'. Then 
the strong version of the Chernoff bound (Theorem 3 (a)) yields 

Pr[|Yj| > 2eln(A)/lnln(A)] = Pr[|Y,| > rE[\Yj\}] 

< (^) E[lYjl] < (e/r) rE ^ = (e/r) 2r ' < ( e/r ') 2r ' 

= exp(ln((e/r , ) 2r )) = exp(—2eln(A)(l — o(l)) 

= A -2e+o(l)_ 

Hence, with probability at least 1 — A _2e+1+ °( 1 \ none of the \Yj\ is larger than 2 r', which implies 
that Om (y) < Om(x) + 2 r'. 

It remains to treat the exceptional cases. If l > 2A, which happens with probability at most 
exp(—0(A)), the expected value of £ is still only E[£\£ > 2A] < 3A + 1, and thus i£[OM(y)] < 
Om(i) + 3A + 1. The second exceptional case occurs when £ < 2A, but (with a probability 
of at most A” 2e+1+o( ' 1 )) we have OM(y) > Om(.t) + 2 r'. In this case, however, we still have 
Om(j/) < Om(x) + £< Om(x) + 2A. 

From all this, we see that 

E[OM(y) - Om(i)] 

< (1 -exp(-0(A))) 

((1 - A- 2e+1+o(1) )(2elog(A)/loglog(A)) + A- 2e+1+o(1) 2A) 

+ exp(—0(A))(3A + 1) 

< 0(log(A)/log log(A)). 
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We now apply the classic additive drift theorem (Theorem 7) and obtain an expected number of 
rounds of fl(nloglog(A)/log(A)), equivalent to an optimization time of Q(nX log log(A)/log(A)). 

□ 


7 Conclusion 

We conducted a tight runtime analysis for the (1 + (A, A)) GA on OneMax giving an improved 
upper bound for the expected optimization time, the first lower bound (which matches the upper 
bound for all values of A), and a tail bound for the upper tail of the runtime distribution. This 
analysis both shows that the (1 + (A, A)) GA is faster than what could be shown in [DDE15], 
and it gives the asymptotically optimal value for the off-spring population size A, again leading 
to a super-constant factor speed-up over the runtime stemming from the value giving the best 
bound in the previous work. 

Our sharp bounds also give more insight in the working principles of this algorithm. In 
particular, we observe that in its crossover phase generating A offspring in parallel often produces 
at least one offspring that is significantly better that the expected outcome of a crossover 
application. This allows to gain several (including regularly a super-constant number in the 
early time of the optimization process) fitness levels in one iteration. This advantage of larger 
offspring population sizes seems to have been rarely analyzed rigorously (with the analyses of 
the (1 + A) EA in [JDW05,DK15] being the only exceptions known to us). The more common 
use of larger offspring population sizes in the literature seems to be that an offspring population 
size of A reduces the waiting time for a fitness level gain by approximately a factor of A (given 
that this waiting time is large enough). This latter argument, naturally, does not reduce the 
(total) optimization time (number of fitness evaluations), but only the parallel one (number of 
generations). 

With the proof methods developed in this work, which include a number of clever combi¬ 
nations of drift and Chernoff bounds, we are optimistic that it is now possible to analyze the 
(1 + (A, A)) GA also on more complicated optimization problems. The first experimental results 
for several standard test functions [DDE15] and combinatorial optimization problems [GP14] 
suggest that this is a fruitful direction of research. 

Acknowledgments 

We thank an unknown reviewer for very detailed comments and pointing us to the work 
Witt [Witl4], 

References 

[DDE13] Benjamin Doerr, Carola Doerr, and Franziska Ebel. Lessons from the black-box: 

Fast crossover-based genetic algorithms. In Proceedings of the Annual Genetic and 
Evolutionary Computation Conference (GECCO’13), pages 781-788. ACM, 2013. 

[DDE15] Benjamin Doerr, Carola Doerr, and Franziska Ebel. From black-box complexity to 
designing new genetic algorithms. Theoretical Computer Science , 567:87-104, 2015. 

[DHK12] B. Doerr, E. Happ, and C. Klein. Crossover can provably be useful in evolutionary 
computation. Theoretical Computer Science , 425:17-33, 2012. 


13 


[DJK+13] 

[DJW02] 

[DK15] 

[Doell] 

[DT09] 

[FW04] 

[GPU] 

[HY01] 

[Janl3] 

[JDW05] 

[JW02] 

[Kotl4] 

[LW12] 

[MHF93] 


B. Doerr, D. Johannsen, T. Kotzing, F. Neumann, and M. Theile. More effective 
crossover operators for the all-pairs shortest path problem. Theoretical Computer 
Science, 471:12-26, 2013. 

Stefan Droste, Thomas Jansen, and Ingo Wegener. On the analysis of the (1+1) 
evolutionary algorithm. Theoretical Computer Science , 276:51-81, 2002. 

Benjamin Doerr and Marvin Kiinnemann. Optimizing linear functions with the 
(1+A) evolutionary algorithm—different asymptotic runtimes for different instances. 
Theoretical Computer Science , 561:3-23, 2015. 

Benjamin Doerr. Analyzing randomized search heuristics: Tools from probability 
theory. In Anne Auger and Benjamin Doerr, editors, Theory of Random¬ 
ized Search Heuristics, pages 1-20. World Scientific Publishing, 2011. Available at 
http://www.worldscientific.com/doi/suppl/10.1142/7438/suppLfile/7438_chap01.pdf 

B. Doerr and M. Theile. Improved analysis methods for crossover-based algorithms. 
In Proceedings of the Annual Genetic and Evolutionary Computation Conference 
(GECCO ’09), pages 247-254. ACM, 2009. 

Simon Fischer and Ingo Wegener. The Ising model on the ring: Mutation versus 
recombination. In Proceedings of the Annual Genetic and Evolutionary Computation 
Conference (GECCO’04), volume 3102 of Lecture Notes in Computer Science , pages 
1113-1124. Springer, 2004. 

Brian W. Goldman and William F. Punch. Parameter-less population pyramid. 
In Proceedings of the Annual Genetic and Evolutionary Computation Conference 
(GECCO’14), pages 785-792. ACM, 2014. 

Jun He and Xin Yao. Drift analysis and average time complexity of evolutionary 
algorithms. Artificial Intelligence, 127:57-85, 2001. 

Thomas Jansen. Analyzing Evolutionary Algorithms—The Computer Science Per¬ 
spective. Springer, 2013. 

Thomas Jansen, Kenneth A. De Jong, and Ingo Wegener. On the choice of the 
offspring population size in evolutionary algorithms. Evolutionary Computation, 
13:413-440, 2005. 

Thomas Jansen and Ingo Wegener. The analysis of evolutionary algorithms - a proof 
that crossover really can help. Algorithmica, 34:47-66, 2002. 

Timo Kotzing. Concentration of first hitting times under additive drift. In Proceed¬ 
ings of the Annual Genetic and Evolutionary Computation Conference (GECCO’14), 
pages 1391-1398. ACM, 2014. 

Per Kristian Lehre and Carsten Witt. Black-box search by unbiased variation. Al¬ 
gorithmica, 64:623-642, 2012. 

Melanie Mitchell, John H. Holland, and Stephanie Forrest. When will a genetic 
algorithm outperform hill climbing? In Proceeding of the 7th Neural Information 
Processing Systems Conference (NIPS), volume 6 of Advances in Neural Information 
Processing Systems, pages 51-58. Morgan Kaufmann, 1993. 


14 


[0Y11] 


[RS98] 


[Sud05] 

[WegOl] 


[Weg02] 

[Wit 14] 


Pietro Simone Oliveto and Xin Yao. Runtime analysis of evolutionary algorithms 
for discrete optimization. In Anne Auger and Benjamin Doerr, editors, Theory of 
Randomized Search Heuristics , pages 21-52. World Scientific Publishing, Singapore, 
2011. 

Martin Raab and Angelika Steger. “Balls into bins” - a simple and tight analysis. 
In Proceedings Randomization and Approximation Techniques in Computer Science 
(RANDOM’98), volume 1518 of Lecture Notes in Computer Science , pages 159-170. 
Springer, 1998. 

Dirk Sudholt. Crossover is provably essential for the Ising model on trees. In Proceed¬ 
ings of the Annual Genetic and Evolutionary Computation Conference (GECCO’05), 
pages 1161-1167. ACM Press, 2005. 

Ingo Wegener. Theoretical aspects of evolutionary algorithms. In Fernando Orejas, 
Paul G. Spirakis, and Jan van Leeuwen, editors, Proc. of the 28th International 
Colloquium on Automata, Languages and Programming (ICALP’01), volume 2076 
of Lecture Notes in Computer Science , pages 64-78. Springer, 2001. 

Ingo Wegener. Methods for the analysis of evolutionary algorithms on pseudo- 
Boolean functions. In Ruhul Sarker, Masoud Mohammadian, and Xin Yao, editors, 
Evolutionary Optimization , pages 349-369. Kluwer, 2002. 

Carsten Witt. Fitness levels with tail bounds for the analysis of randomized search 
heuristics. Information Processing Letters , 114:38-41, 2014. 


15 



